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Abstract 

In  a  number  of  problems  in  computational  physics,  a  finite  sum  of  kernel  functions  centered  at  N 
particle  locations  located  in  a  box  in  three  dimensions  must  be  extended  by  imposing  periodic  boundary 
conditions  on  box  boundaries.  Even  though  the  finite  sum  can  be  efficiently  computed  via  fast  summa¬ 
tion  algorithms,  such  as  the  fast  multipole  method  (FMM),  the  periodized  extension  is  usually  treated 
via  a  different  algorithm,  Ewald  summation,  accelerated  via  the  fast  Fourier  transform  (FFT).  A  differ¬ 
ent  approach  to  compute  this  periodized  sum  just  using  a  blackbox  finite  fast  summation  algorithm  is 
presented  in  this  paper.  The  method  splits  the  periodized  sum  in  to  two  parts.  The  first,  comprising 
the  contribution  of  all  points  outside  a  large  sphere  enclosing  the  box,  and  some  of  its  neighbors,  is 
approximated  inside  the  box  by  a  collection  of  kernel  functions  (“sources”)  placed  on  the  surface  of  the 
sphere  or  using  an  expansion  in  terms  of  spectrally  convergent  local  basis  functions.  The  second  part, 
comprising  the  part  inside  the  sphere,  and  including  the  box  and  its  immediate  neighborhood,  is  treated 
via  available  summation  algorithms.  The  coefficients  of  the  sources  are  determined  by  least  squares 
collocation  of  the  periodicity  condition  of  the  total  potential,  imposed  on  a  circumspherical  surface  for 
the  box.  While  the  method  is  presented  in  general,  details  are  worked  out  for  the  case  of  evaluating 
electrostatic  potentials  and  forces.  Results  show  that  when  used  with  the  FMM,  the  periodized  sum  can 
be  computed  to  any  specified  accuracy,  at  a  cost  that  is  twice  that  of  the  free-space  FMM  with  the  same 
accuracy.  Several  technical  details  and  efficient  algorithms  for  auxiliary  computations  are  provided,  as 
are  numerical  comparisons. 
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1  Introduction 


Many  problems  in  physics,  chemistry  and  materials  science  lead  to  a  free-space  finite  “particle”  sum  of  N 
functions,  K ,  centered  at  locations  x,  G  <>0  C  M3,  where  Q0  is  a  rectangular  box  d\  x  d-2  x  d->,  centered  at 
the  origin  of  the  reference  frame 

N 

f  (y)  =  ^2  <uK  (y  -  xi)  •  (i) 

i= 1 

For  evaluation  at  N  locations  y,  this  sum  has  a  quadratic  cost.  There  arc  efficient  and  arbitrarily  accurate 
approximation  algorithms  for  this  summation  (e.g.,  the  fast  multipole  method,  FMM  [6]). 

Often,  an  extension  to  this  sum  for  0  must  be  computed  in  which  periodic  boundary  conditions  arc 
enforced  on  box  boundaries,  resulting  in  the  potential  f.  This  can  be  evaluated  by  replacing  the  sum  (1) 
with  the  infinite  sum 

N 

<f>{y)  =  X/ X! <llK  (y  -  x*  +  p)  ,  p  e  P  =  {(Mi, M2, M3) :  (*i,*2,*3)  e  ^3}  •  (2) 

p  i=  1 

For  some  functions  I\ ,  such  as  those  representing  the  field  of  an  electrostatic  charge,  this  infinite  sum  may 
be  divergent  or  conditionally  convergent.  In  this  case  certain  side  conditions  may  be  needed  to  compute 
a  physically  relevant  sum.  Usually  such  infinite  sums  arc  performed  using  Fourier-transform  based  Ewald 
summation  [1],  which  is  accelerated  via  the  FFT.  This  method  is  described  briefly  in  Appendix  C.  Account¬ 
ing  for  all  pairwise  interactions  the  method  can  achieve  0(N  log  N)  complexity,  for  N  particles  in  the  box 
Q0  which  is  periodically  replicated  over  the  full  space  [2,  3].  Because  of  the  technique  used  for  grid-to- 
particle  interpolation  these  methods  arc  usually  low-order.  A  high-order  accurate  Gaussian  interpolation 
based  Ewald  summation  algorithm  was  recently  presented  in  [4,  5]. 

A  criticism  of  FMM  algorithms  has  been  that  they  arc  relatively  harder  to  implement,  combining  the 
need  for  efficient  data  structures,  careful  analysis  and  computation  of  special  functions,  and  mixed  memory 
access  patterns.  Nevertheless,  several  open-source  and  commercial  packages  implementing  the  FMM  for 
standard  kernels  in  free  space  have  become  available.  The  FMM  is  not  often  used  in  practice  to  compute 
periodic  sums,  even  though  several  methods  to  handle  periodic  boundary  conditions  using  extensions  to  the 
basic  FMM  have  been  proposed,  starting  from  the  first  publication  of  the  algorithm  [6,  7,  8,  9,  10,  11,  12, 
13].  One  issue  with  these  extensions  is  that,  compared  to  the  basic  FMM  for  the  summation  of  free-space 
functions,  they  arc  more  expensive  and  may  require  a  more  complicated  algorithm  (data  structures  and  more 
complex  functions,  e.g.,  periodic  Green’s  functions).  Analysis  of  the  FMM  and  its  plane-wave  valiant, 
which  is  better-suited  tor  large  N  and  parallel  architectures  than  the  smooth  particle  mesh  Ewald  algorithm, 
is  presented  in  [14].  However,  all  these  methods  require  constructing  a  new  and  different  algorithm  -  a 
periodic  variant,  for  which  efficient  implementations  arc  not  in  general  available. 

Special  purpose  hardware  such  as  graphics  processors  or  heterogeneous  CPU/GPU  architectures  also 
allow  the  fast  computation  of  finite  sums,  either  via  brute  force  summation  [15],  or  via  the  mapping  of  the 
FMM  onto  these  architectures  [16,  17,  18,  19].  Yokota  et  al.  [19]  favorably  compare  a  large  scale  FMM- 
based  vortex  element  computations  with  a  direct  numerical  simulation  via  periodic  pseudospectral  methods. 
Their  simulations  could  have  been  faster  and  more  accurate  -  the  FMM  was  executed  on  a  finite  system 
composed  of  33  images,  which  while  not  being  truly  periodic  also  makes  using  the  FMM  significantly  more 
expensive. 

The  problem  this  paper  seeks  to  address  is:  Given  a  black-box  fast  summation  algorithm  (FSA)  the  user 
has  for  computing  finite  sums  with  a  given  kernel  K  (y  —  xt),  is  it  possible  to  compute  the  same  sum  with 
periodic  boundary  conditions  without  any  modification  of  the  FSA?  We  provide  a  positive  answer  to  this 
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question.  Our  algorithm  has  the  same  cost  as  the  FSA,  and  can  be  computed  to  any  user  specified  accuracy 
e,  and  does  not  use  the  FFT.  The  basic  idea  of  the  method  is  to  divide  the  sum  (2)  in  to  two  pa  its.  One  paid 
computes  a  finite  sum  of  particles  that  lie  within  a  sphere  centered  at  the  box.  This  is  computed  using  the 
available  FSA.  The  other  part,  is  an  approximation  of  the  field  within  the  box  due  to  all  particles  outside 
the  sphere.  The  field  due  to  these  sources  can  be  represented  within  the  box  in  terms  of  local  expansions. 
Such  local  expansions  have  also  been  proposed  in  other  attempts  to  extend  the  FMM  to  periodic  systems, 
but  arc  there  derived  by  explicit  translation  of  multipole  expansions  from  outside  the  box  of  interest  into 
it.  In  our  method,  we  propose  to  determine  the  coefficients  directly  from  the  periodicity  conditions  on 
the  potential,  which  results  in  solution  of  a  relatively  small  overdetermined  function-fitting  problem,  easily 
solved  via  standard  algorithms  -  e.g.,  rank-revealing  Q II  decomposition.  This  step  is  in  the  spirit  of  the 
“kernel-independent”  FMM  methods  [23,  24].  The  computational  overhead  of  our  method  is  of  the  order  of 
the  cost  of  the  finite  FMM  for  the  non-periodic  (free  space)  problem  with  N  particles  in  a  box,  and  overall 
the  periodic  sum  is  computable  for  about  twice  the  cost  of  the  finite  sum. 

We  present  this  “periodization”  approach  in  a  general  setting,  but  focus  computational  examples  on  the 
evaluation  of  the  electrostatic  potential  0  and  its  gradient  V <i>  at  M  evaluation  points  y  ;  G  Do  C  R3  due  to 
N  charged  particles  of  charge  q,  placed  in  Do,  and  subject  to  periodic  boundary  conditions  on  dflo-  This 
reduces  to  computing  the  infinite,  conditionally  convergent,  sum  (1),  with  kernel  function  K: 

1  N 

K  (y  -  x)  =  _  y  /  x;  K  (y  -  x)  =  0,  y  =  x,  ^  =  0.  (3) 

|y  x|  *=i 

and  the  net  charge  in  each  box  being  zero.  Of  course,  in  practical  applications,  such  as  in  molecular  dynam¬ 
ics,  there  will  be  other  computations  in  addition  to  the  one  discussed  here,  to  stabilize  the  overall  computa¬ 
tions.  This  paper  does  not  consider  these,  focusing  on  the  electrostatic  sum  at  a  single  time  step. 

The  periodization  method  could  be  easily  applied  to  other  kernels  for  which  a  “fast  summation  algo¬ 
rithm”  (FSA)  is  available.  As  long  as  the  periodic  sum  makes  sense,  and  if  the  kernel  K  can  be  expanded 
over  some  local  basis  the  method  should  work.  Conditions  similar  to  the  charge  neutrality  in  (3)  may  be 
necessary.  The  proposed  method  can  also  be  applied  for  2D  problems,  though  we  present  it  in  3D.  Also,  the 
periodic  extension  of  the  computational  domain  (box)  may  be  performed  only  along  one  or  two  coordinates, 
for  which  Ewald  summation  may  have  problems. 

2  Proposed  method 

2.1  Periodization 

The  box  on  which  the  periodic  sum  is  to  be  computed  is  denoted  Do-  Let  ,5’o  be  a  ball  of  radius  Rq  centered 
at  the  center  of  the  box  Do  and  containing  it.  Let  Sj,  be  another  ball  with  the  same  center  and  radius  Rf,  >  R0 
(see  Lig.  1).  We  denote  as  D/,  a  finite  region  which  includes  the  ball  .S'/,  (the  minimal  D/,  is  the  ball  S},).  We 
decompose  the  infinite  sum  as 

</>  (y)  =  </>  near  (y  )+<t>f  ar  (y)  5  finear  (y)=  5Z  qjK(y-Xj),  </>f 

ar  (y)  =  J2  vK  (y  _  xi)  ’  (4) 


where  4>near  (y)  is  to  be  computed  using  the  LSA  (assumed  to  be  the  LMM  in  the  sequel)  to  the  specified 
accuracy  e  while  0jar  (y)  is  computed  by  some  other  method  at  least  to  the  same  accuracy.  The  sources  in 
the  infinite  domain  are  indexed  as  x;  =  x,  —  p,  q,}  =  q,  for  appropriate  vectors  p  £  P  (see  Eq.  (2)). 

To  apply  the  LMM  to  the  kernel  function  K  (y)  it  should  be  possible  to  approximate  it  via  a  convergent 
series  over  a  set  of  local  basis  functions  {Rt  (y)}.  This  means  that  for  any  source  point  Xj  0  D/,  we  have 
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Figure  1:  The  particle  sum  (2)  in  the  box  Qq,  subject  to  periodic  boundary  conditions  on  the  box  boundary, 
is  to  be  computed  to  a  specified  accuracy  e.  The  boundary  condition  can  be  enforced  by  the  image  method, 
which  results  in  an  infinite  set  of  copies  of  the  box  Qq.  The  proposed  method  divides  the  sum  into  a  near-field 
component  in  Clb  and  a  far-field  sum. 


the  factorization 

p 

K(y  -xj)  =  '52Bt(xj)Rt(y)  +  e(jP\  |xj|  >  Rb,  |y|  <  Rq,  (5) 

t= 1 

where  P  is  the  number  of  terms  retained  in  the  infinite  series,  Bt  fxy )  are  the  expansion  coefficients,  and 

(P) 

(j  is  the  truncation  error  depending  on  x;;  and  Rq.  The  basis  functions  can  be  some  standard  choices,  or, 
as  in  the  kernel  independent  FMM  [24],  can  be  taken  to  be  a  collection  of  kernel  functions,  centered  outside 
the  region  of  approximation 

Rt  (y)  =  K  (y  -  x^s))  ,  x^a)  >  Rb,  (6) 

(s) 

where  xj  are  a  collection  of  sources  located  outside  ball  Sb.  This  method  has  the  flavor  of  “equivalent- 
source”  methods.  Since  the  function  K  depends  only  on  the  distance  between  its  argument,  it  is  a  “radial 
basis  function”,  or  RBF  [21],  While  any  approximation  scheme  may  be  used,  the  use  of  kernel  I\  as  RBF 
is  dictated  by  the  fact  that  this  kernel  satisfies  the  underlying  equation  (e.g.  the  Laplace  equation),  so  the 
approximation  to  the  field  via  the  sum  of  such  RBFs  also  satisfies  the  equation  (if  it  is  linear  and  space 
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(7) 


invariant).  Substituting  Eq.  (5)  into  the  expression  for  <Pjar  (y)  from  Eq.  (4),  we  obtain 


4>far(  y)  =  ]TctRt(y)  +  e(p\ 

t=  1 

ct  =  (x- 

b 


> 


e(F)  =  V  s<f  >. 


A  necessary  condition  for  our  method  is  convergence  of  both  infinite  sums  Ct  and  The  problem  of 
computation  of  (Pjar  (y)  has  been  reduced  to  that  of  determination  of  P  fitting  coefficients  Ct-  These  can  be 
determined  via  least-squares  collocation  as  follows.  Consider  a  set  of  L  >  P  check  points,  Y*C  c  ,S'o\Oo. 
A  point  y\r}  e  has  two  properties:first,  y)r)  <  R0,  and,  second,  that  there  exists  p  eP  such  that  point 
y|f7  =  y 77  _|_  p  ^  50  (sec  Fig.  1).  This  means  that 

</> (yfc))  =  </> (yjc)) ,  i  = 


(8) 


In  terms  of  decomposition  (4)  and  representation  of  the  far  field  (7)  this  system  can  be  rewritten  as 

p 

Y,AuCt  =  fi  +  e\P\  l  =  1, ...,  L,  (9) 


t=  t 


=  Rt  (yjc))  -  Rt  (y!c)) ,  fi  =  Kear  (y {c))  -  </w  (yfc)) 


where 


XP) 


{P)  -Pp)  (v[c) 


y  i 


yt  ^  2 maxyg50  (y)|  .  We  have  L  linear  equations  in  P 
unknowns  C\, ...,  Cp.  As  we  arc  not  constrained  with  the  size  of  the  set  of  the  set  points,  L  >  P  can  be 
selected  to  provide  a  substantial  oversampling,  so  minimization  of  functional 


F(C1,...,CP)  =  EE  AitCt-U  , 


(10) 


i=i  \t= t 


(  P) 

should  take  care  about  the  “noise”  introduced  into  the  approximation  due  to  ej  .  The  least  square  mini¬ 
mization  procedure  is  well  known  and  formally  it  results  in  solution 

C  =  Atf,  At  =  (AtA)_1  At,  (11) 

where  C  =  {Ct}  and  f  =  { f) }  are  organized  as  column  vectors  of  size  P  and  L,  respectively  and  At  is 
the  P  x  L  matrix,  which  is  the  pseudoinverse  of  A,  and  superscript  T  denotes  transposition.  Note  that 
this  notation  is  formal,  and  is  not  the  way  the  least-squares  problem  is  solved  in  practice.  Rather  a  stable 
algorithm  such  as  the  rank-revealing  QR  decomposition  [28]  is  used. 

The  known  coefficients  C  allow  computation  of  4>j:ar  (y)  and  can  be  added  to  the  4>near  obtained  via 
the  FSA.  However,  some  technical  details  need  to  be  specified.  In  the  next  section  we  provide  analysis  and 
details  for  the  important  case  of  the  Coulombic  kernel  (2). 

A  similar  collocation  of  kernel  based  RBF  expansions  at  a  relatively  small  amount  of  the  check  points 
is  also  used  in  the  “kernel  independent”  FMM  [23]  with  basis  functions  (6).  There,  the  collocation  is  at  the 
level  of  the  boxes  in  the  FMM  octree  data  structure,  and  fitting  takes  the  place  of  expansions  and  translations. 

(c) 

Here,  we  collocate  the  differences  of  the  overall  solution  at  a  set  of  check  points  y,  and  at  their  periodic 

~(c) 

images  yj  ,  at  the  level  of  the  overall  domain  to  determine  the  expansion  coefficients  for  4>far. 
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Figure  2:  The  check  point  distributions  over  the  surface  of  a  unit  sphere  for  p  =  16:random  distribution 
(L  =  3 p2  points),  the  Gauss  spherical  grid  (L  =  2 p2  —  p),  and  the  Thomson  points  (L  =  p2). 


2.2  Check  point  set 

Selection  of  an  optimal  set  of  check  points  Y(c)  C  So\^o  is  not  trivial.  A  good  check  point  set  should  yield 
a  well  conditioned  solution,  and  sample  the  solution  well  spatially.  A  simple  way  which  does  not  yield  such 
a  well  conditioned  set  is  to  select  the  points  yj'  c)  G  y(c)  on  the  boundary  of  box  Qq.  This  is  because  the 

~(c)  (c) 

corresponding  periodic  point  y,  will  then  also  be  located  on  the  box  boundary  on  the  face  opposite  to  y,  . 

(c)  ~(c) 

In  this  case  the  distances  from  the  box  center  to  y]  and  to  y,  will  be  the  same,  so  both  points  will  be 
located  on  a  sphere  of  radius  ry  .  If  the  basis  is  based  on  spherical  functions,  then  several  of  these  will  take 
the  same  value  for  symmetrical  points  on  the  sphere,  and  the  fitting  equations  may  be  rank  deficient. 

To  avoid  this  degeneracy,  the  set  was  chosen  from  points  on  the  surface  of  ball  So.  Three  distri¬ 
butions  were  tried  (see  Fig.  (2):  i)  random  uniformly  distributed  points,  ii)  Gaussian  nodes  (zeros  of  the 
Legendre  polynomial  along  9,  Pp  (cos  9),  see  [29])  and  equispaced  with  respect  to  ip),  and,  iii)  the  almost 
uniform  distribution  of  points  over  the  sphere  obtained  by  solving  the  so-called  Thomson  problem  of  the 
equilibrium  position  of  mutually  repelling  electrons  constrained  to  be  on  the  surface  of  the  sphere  [30],  see 
[31],  Methods  ii)  and  iii)  show  better  results  than  the  random  distribution,  as  shown  in  Section  3.  Further, 
we  fond  that  using  the  Thomson  points  for  the  interpolation  in  Eq.  6  provides  good  accuracy. 

2.3  Periodization  algorithm 

The  algorithm  has  two  parts.  In  a  first  preliminary  set-up  step,  denoted  “set”,  the  check  points  are  deter¬ 
mined,  and  the  matrix  decompositions  necessary  to  compute  the  least  squares  fit  with  the  matrix  A  in  Eq. 
(11)  are  precomputed.  In  simulations  where  the  domain  Qq  is  fixed  and  the  particles  move,  as  in  molecular 
dynamics,  this  matrix  does  not  change,  and  the  cost  of  the  “set”  step  is  amortized  over  the  entire  simula¬ 
tion.  The  second  paid  of  the  algorithm,  denoted  “get,”  computates  the  right  hand  side  and  solution  of  the 
fitting  equations  via  an  inexpensive  step  such  as  backsubstitution.  The  accuracy  depends  on  the  choice  of 
basis  functions,  and  the  parameters  P,L,  and  /(/,.  In  the  next  section  we  provide  both  a  theoretical  and  an 
empirical  study  for  the  case  of  the  Coulombic  kernel  (Green’s  function  of  Laplace’s  equation). 
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2.3.1  Algorithm  “set” 

1.  Set  the  circumsphere  radius  Rq  =  \\/d\  +  +  d|.  Based  on  the  required  accuracy  determine  Ri„ 

P ,  and  L  ^  P. 

(c) 

2.  Generate  L  check  points  distributed  over  the  surface  of  the  ball  So,  yl  ;  €  dSg,  l  =  1, L.  Denote 
this  point  set  as  Yc\. 

3.  For  each  point  yjc^  =  ■  yj^ ,  yj'^ )  ,1  =  1,  ...,L,  find  a  point  yj''1  €  So,  such  that  two  Carte- 

~(c)  (c) 

sian  coordinates  of  y)  arc  the  same  as  those  of  the  respective  coordinates  of  yj  ,  while  the  other 
coordinate,  yfi}  is  shifted  by  dp-  with  respect  to  yfi}.  Denote  Yc 2  =  jyj^  |  . 

4.  Form  the  L  x  P  fitting  matrix  A  =  {Au},  Alt  =  Rt  (yj'0)  ~  Rt  ( yjc) ),  l  =  1,  •••,  L,t  =  1, ...,  P, 
where  Rt  (y)  are  the  basis  functions  at  the  checkpoints. 


5.  Compute  matrix  decomposition  of  A  necessary  to  solve  the  least  squares  problem. 

6.  (optional)  Precompute  other  parameters  which  do  not  depend  on  the  source  distribution.  If  the  sum¬ 
mation  needs  certain  auxiliary  computations  to  ensure  convergence,  do  those  steps.  For  the  Coulomb 
kernels  this  may  involve  computation  of  the  integrals  of  the  basis  functions  over  the  box  Do. 


2.3.2  Algorithm  “get” 

1.  Periodically  extend  the  source  box  Do  to  cover  the  ball  Sb  of  radius  /?/,.  The  newly  generated  sources 
and  charges  will  have  coordinates  Xp  =  {x,  +  p}  for  the  values  of  the  periodization  vector  pi, ...,  p/, 
from  set  P  (see  Eq.  (2))  and  charges  Qp  =  { r/? }  .  Denote  the  set  of  all  sources  as  Xb  =  X 0  U  XPl  U 
...  U  XPb.  These  constitute  D&. 

2.  Find  the  set  of  JVj,  sources  Xnear  by  removing  sources  from  the  set  Xb  that  arc  outside  the  sphere  Sb, 
and  satisfy  |xj|  >  Rb  (Xnear  =  {x  EXj  :  |x|  ^  i?&}) . 

3.  Using  the  FSA  compute  4>near  for  a  given  set  of  evaluation  points  Yq  =  { y:j }  ,j  =  1, ...,  M  residing 
in  Do  and  belonging  to  sets  Y&  and  Yc 2,  i.e.  for  points  from  the  set  Y  =  {y  gYJj  U  Y  \  U  Yc 2}  .  If 
gradient  computations  arc  needed,  compute  X0near  at  y?  e  Yq. 

4.  Form  the  right  hand  side  of  the  periodization  equation  f  =  {_/)}  ,  fi  =  (j)near  (yj'  '  j  —  4>near  (yj'  !), 
l  =  1 , ....  Ij.  organized  in  a  column  vector. 

5.  Using  the  matrix  decompositions  in  the  “set”  step  solve  the  fitting  equations  for  the  P  expansion 
coefficients  C  =  (Ci, ...,  Cp)T  .  This  step  can  formally  be  written  as  C  =  AT. 

6.  (optional)  Compute  the  constant  shift  or  other  modification  of  the  far  field  potential,  if  needed. 

7.  Evaluate  0jar  (yj),  and  if  gradient  computation  is  needed,  Y(pjar  (yj),  at  y?  e  Yq. 

8.  Get  the  periodized  solution  of  the  problem,  0  (y j)  =  0near  (yj)  +  <t>far  (y j)  ■  y j  G  Yq.  and  if  gradient 
computation  is  needed,  V<j>  (y j)  =  V 4>near  (y j)  +  Vcpfar  (yj) ,  y,  €  Y0. 

Remark  1  In  molecular  dynamics  and  other  N-body  simulations  the  source  and  evaluation  points  are  the 
same,  so  Yq  =  Xq. 
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Remark  2  In  Step  5  for  the  Laplacian  kernel  and  spherical  basis  functions  C  =  (C2,  ■■■,  Cp)1 .  In  this  case 
optional  Step  6  provides  C\  (see  Eqs  (35)-(38)).  Otherwise  set  C\  =  0. 

Remark  3  Step  2  reduces  (l),  to  the  ball  Sb,  which  is  not  necessary,  but  is  efficient. 


2.3.3  Complexity 

Evaluation  of  the  kernel  at  a  single  point  requires  0(1)  operations.  So,  O(P)  operations  are  needed  to 
evaluate  all  basis  functions.  The  complexity,  O(P),  is  also  achieved  when  basis  functions  at  a  point  can  be 
computed  recursively  (as  for  the  spherical  basis  functions).  With  L  =  O(P)  we  can  estimate  the  complexity 
of  the  “set”  paid  of  the  algorithm  as 


C(set)  =  +  Q  (p)  +  +  O  (P)  +  O  ( P 3)  +  [0(P5/2)j  =  O  ( P 3)  , 


(12) 


where  we  assumed  that  computation  of  the  matrix  decomposition  (e.g.,  via  QR)  O  (P3)  operations,  as 
P  ~  L.  The  term  in  the  square  brackets  is  the  cost  of  the  optional  step  for  the  Laplace  kernel  and  spherical 
basis  functions,  using  the  method  presented  in  Appendix  A. 

For  the  “get”  paid  of  the  algorithm,  assuming  that  7Vb  =  O(N)  and  f’  C  T  we  have 

c^et)  =  0(N)+0{N)+C(fsa^+0  (P)+0{P2)+[0  (N)]+0  ( PN)+0(N )  =  O  (PN)+C^FSA\  (13) 

where  C(  FSA  ]  is  the  cost  of  the  finite  summation  algorithm  and  the  cost  of  the  optional  step  of  the  algorithm 
is  put  in  the  square  brackets  (see  Appendix  B  for  this  step  for  the  Laplacian  kernel).  This  cost  for  the  brute 
force  summation  is  0(N2),  while  if  the  FMM  is  used  as  the  FSA  it  can  be  estimated  as  follows. 

In  the  FMM,  generation  of  the  data  structure  for  M  ~  N  points  is  0(Nlm ax)  which  for  deep  trees, 
Imax  =  OilogN),  results  in  formal  0(N  log  N)  complexity.  However,  in  practice  the  depth  of  the  trees  in 
three  dimensions  is  relatively  small  (e.g.  Zmax  <  10)  for  sizes  N  <  10'  and  even  at  larger  Zmax  this  cost  is 
much  smaller  than  the  cost  of  the  run  paid  of  the  FMM.  Moreover,  the  translation  time  in  the  FMM  usually 
dominates  over  the  time  of  generation  and  evaluation  of  expansions  of  complexity  O(PfmmN),  where 
Pfmm  is  the  size  of  the  expansions  in  the  FMM.  For  optimal  Zmax  the  FMM  using  O  (  Pfmm)  methods 
for  translations  in  three  dimensions  scale  as  0{PfMMN)  or  0(PfMMN),  where  \  ^  a  ^  |  for  well 
studied  kernels  such  as  those  for  the  Laplace  and  Helmholtz  equations  (in  the  latter  case  additional  log”  N, 
/3  >  0  factors  appeal-  in  the  algorithm  complexity  ,  which  we  drop  in  the  present  estimate).  Optimized  kernel 
independent  FMM  has  complexity  0{PfmmN ).  This  can  be  summarized  as 

Ci9et)  =  O  (PfMMN)  +  O  (PN) ,  (14) 

Note  that  for  the  Laplacian  kernel  in  three  dimensions  truncation  numbers  p  =  P1/2  and  pfmm  = 
1  /2 

Pfmm  should  increase  as  Of  log  N)  for  a  fixed  absolute  7.^ -norm  error  (see  error  bounds  below),  while 
they  are  constant  for  the  relative  L2-norm  errors. 


3  Laplacian  kernel 

A  fundamental  feature  of  the  Laplace  equation  that  any  constant  is  a  solution.  Thus,  one  of  the  basis 
functions  is  a  constant  (say,  Pi  (y)  =  1).  Moreover,  any  constant  satisfies  periodic  boundary  conditions, 
and  so  this  paid  of  the  solution  cannot  be  determined,  as  Pi  (y)  belongs  to  the  null-space  of  the  Laplacian 

operator.  System  (9)  also  shows  that  An  =  Pi  (yj  —  Pi  (y^ ' )  =  0  for  any  l,  so  for  any  L  f  P  the 


rank  of  matrix  A  cannot  exceed  L  —  1,  in  which  case  the  coefficient  C\  can  be  arbitrary.  To  remove  this 
rank  deficiency  of  A  we  can  simply  remove  the  constant  basis  function  from  consideration,  formulate  the 
problem  as  the  problem  of  determination  of  coefficients  C)  for  t  =  2. ....  L  and  then  add  an  arbitrary  constant 
C\  to  the  solution.  Indeed,  in  many  cases  the  value  of  the  potential  is  not  important,  as  only  differences  and 
gradients  determine  physical  quantities  such  as  the  electric  field,  velocities,  forces,  etc.  For  comparison  with 
the  FFT-based  Poisson  equation  solutions  however,  it  is  desirable  to  obtain  C\ ,  in  which  case  an  additional 
condition,  zero  period  average,  needs  to  be  imposed.  These  will  be  discussed  in  a  separate  subsection.  Here 
we  just  mention  that  many  other  equations  have  the  same  problem  (e.g.  the  biharmonic  equation)  and  the 
null-space  of  some  equations,  such  as  the  Helmholtz  equation,  may  have  larger  dimension,  and  an  analysis 
similar  to  the  one  for  the  Laplace  equation  presented  here  would  be  needed. 


3.1  Spherical  basis  functions 

While  the  basis  functions  can  be  simply  selected  according  to  Eq.  (6),  we  instead  consider  the  closely  related 
polynomial  basis,  in  which  case  we  can  establish  error  bounds.  In  spherical  coordinates  (r,  9 ,  p)  related  to 
the  Cartesian  coordinates  via 


x  =  r  sin  9  cos  ip,  x  =  r  sin  9  sin  p,  z  =  rcos9, 


(15) 


the  local  and  multipole  solutions  of  the  Laplace  equation  in  3D  can  be  represented  as 


C(r)  =  anJ’"C(^)!  Sn  (r)  =  Pnr  n  1Y™(9,p),  71  =  0,1,...,  m  =  -n, ...,  n.  (16) 


Here  /(”'  (r)  arc  the  regular  (local)  spherical  basis  functions  and  S™  (r)  the  singular  (or  multipole)  spher¬ 
ical  basis  functions;  cr™  and  /)"'  arc  normalization  constants  which  can  be  selected  by  convenience,  and 
Y"'  (9.  p)  arc  the  orthonormal  spherical  harmonics: 


Y?  (9,  p)  =  N?pW  {p)eim^  /i  =  cos  9, 

=  /2n  +  l  {n-  \m\)\ 

n  [  ’  V  47T  (n  +  \m\)V 


n  =  0,1,2,...,  m  =  —  n,...,n, 


(17) 


where  P^  (//  )  arc  the  associated  Legendre  functions  [29].  We  will  use  the  definition  of  the  associated 
Legendre  function  P™  (/<)  that  is  consistent  with  the  value  on  the  cut  (—1, 1)  of  the  hypergeometric  function 
P™  (z)  (see  Abramowitz  &  Stegun,  [29]).  These  functions  can  be  obtained  from  the  Legendre  polynomials 
Pn  (//)  via  the  Rodrigues’  formula 


PZ  M  =  (-i)™  (i  -  m2) 


2\m/ 2  d‘ 


1  dn 

d^rn P -  ^  ’  Pn^)  =  Wn\dpP^2-^ 


(18) 


Straightforward  computation  of  these  basis  functions  involves  several  relatively  costly  operations  with  spe¬ 
cial  functions  and  use  of  the  spherical  coordinates.  Further,  as  defined  above,  these  functions  are  complex, 
which  is  an  unnecessary  expense  for  real  valued  computations.  In  [16]  real  basis  functions  were  defined  as 


~m=[  R e{R™},  m>  0  =  f  Re{5™},  m>  0 

\  m<  O’  n  \  IrnjS'™},  m<0 

with  /(”'  and  S'™  defined  via  Eq.  (16)  are 


CL 


=  (-1)’ 


4vr 


(2 n  +  1)  (n  —  m)\(n  +  m)V 


ft 


n  =  0,1,....,  m  =  —n,...,n. 


—  m)\(n  +  m)\ 
2n  +  1 


(19) 


(20) 
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Only  local  basis  functions  arc  needed  here,  and  can  be  computed  via  an  efficient  recursive  process,  without 
spherical  coordinates, 


*0=1,  R'1=\y' 

,  ,  (xR\n\-\+yR7^1 

pH  _  V  \m\-i  y  H-i 

M“  2  I  ml 


r>~\m\  _ 

\>n\  ~ 


—  |m|+l 
\m\— 1 


2  Iml 


r>m  _  „  r?m 


m  =  0,  ±1, 


iC  =  - 


(2n-l)^_1  +  r^_2 
(n  —  |m|)  (n  +  |m|) 


(21) 


|m|  =  2,  3, ... 


n  =  \m\  +  2, ....,  m  =  —n,...,n. 


While  this  basis  is  good  for  the  FMM,  the  matrix  A  for  fitting  (j) jnr  was  found  to  be  poorly  conditioned, 
because  the  functions  decay  strongly.  We  fix  this  problem  using  the  following  renormalized  basis 

R™  =  \JJn  —  m)\{n  +  m)\R n  =  0, 1,....,  m  =  —  n,  (22) 

This  basis  can  be  obtained  in  the  same  manner  as  |  l?,"1 1  was  from  the  complex  basis  R™',  where  a™  = 

(— l)n  \/ /  (2n  +  1) .  Complex  basis  functions  /('''  with  a  similar  normalization  a™  (without  the  factor 
(— l)n)  were  used  in  [33].  Note  further  that  p-truncated  expansion  of  a  harmonic  function  (j)jar  over  basis 
(22)  can  be  written  as 

p—ln  P 

4>far  (y)  =  E  E  C™R™(y)  =  J2CtRt(y),  Ct  =  c™,  Rt(y)  =  R™(y),  P  =  p\ 

n=0m=—n  t= 1 

t  =  (n  +  l)2  —  (n  —  m),  n  =  0, 1,  —  1,  m  =  —n,...,n.  (23) 


The  latter  form  of  the  sum,  where  stacking  of  coefficients  is  used,  is  consistent  with  (7).  The  gradient  of  the 
potential  is  needed  to  compute  the  force.  This  can  be  computed  as 


V</>/ar  (y)  = 


p—  1  n  P 

£  £  E”7C(y)  =  £Etft(y). 


n=l  m=—n 


t= 2 


(24) 


where  E"'  arc  vectors  in  M3.  In  [35]  one  can  find  relations  between  the  potential  and  gradient  coefficients. 


3.2  Error  bounds 


For  the  Laplacian  kernel  K,  Eq.  (2),  p-truncated  expansions  (5)  over  the  basis  { /?”'■ } ,  Eq.  (16),  has  a 
well-known  error  bound 


< 


*o_v 


p 


=  pi/2. 


|xjl  -*o  VM7 

The  expansion  error  (7)  due  to  all  sources  located  in  £3\(£  then  can  be  bounded  as 

Ro^p 


,(p) 


< 


< 


max  gj 

( Ro^ ; 

r< 

Rh  —  Ro  ~L 

VM, 

max  ®  r 

n  (x) 

b 

7*0 

Rb  —  Ro  J R3/q( 

\  r 

E 

xj)> 

r  = 

XjGM3/^b 


max  \  qi\ 


dV, 


£ 


XjGMP/Clb 


X) 


(25) 


(26) 
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Here  we  introduced  the  the  number  density  n(x),  which  for  integral  estimates  can  be  replaced  with  a 
constant  density  no  =  TV/ Vq,  where  Vo  is  the  volume  of  box  Ho,  Vo  =  d\  d^dp.  In  this  case  the  integral  can 
be  evaluated  as 


[  „  (*)  (*>  V  dV  ~  4jrno  f  («>)’ Pir  =  A 

TR3/ni)  V  r  y  jRb  \  r  J  P-3 


Rb 

Ro 


Hence,  we  have  an  approximate  error  bound 


AP) 


<  VrnoRl  max\qj\ ^_p 


u  Rb  -  Ro  p~  3 

For  a  cubic  domain  we  have  d\  =  cfe  =  ^3  =  d,  Rq  =  \d^/ 3,  Rb  =  A-Rq.  and  we  get 


AP) 


3nN  max  |  %  | 


1 


d  (A  -  1  )(p-  3)  Xp 


-3  ' 


(27) 


(28) 


(29) 


The  actual  error  achieved  in  practice  is  expected  to  be  much  smaller  than  this  estimate  since  it  neglects 
cancellation  effects  due  to  the  total  charge  neutrality.  Also  in  the  above  equation  one  can  set  d  =  1  to  obtain 
a  non-dimensional  measure  of  the  absolute  error  (since  the  Laplace  equation  is  scale  independent). 


3.3  Optimization  when  using  the  FMM 

There  are  three  free  parameters,  p,  pfmm,  and  A,  which  can  be  selected  to  optimize  algorithm  performance. 
Assuming  that  the  number  of  charges,  their  intensities  and  distribution  as  well  as  the  domain  Ho  are  fixed, 
and  computations  performed  with  some  prescribed  tolerance,  e,  the  optimization  problem  can  be  formulated 
as 


ep  ( P ,  A)  =  e,  €fmm  (pfmm,  A)  =  e,  C<'gct'1  ( P,Pfmm ,  A)  -A  min,  (30) 

where  ep  and  cfmm  are  the  error  bounds  for  the  periodization  and  the  FMM  respectively,  while  C'VVi  js  the 
cost  of  the  “get”  step.  In  practice  these  functions  should  be  determined  experimentally,  using  the  qualitative 
theoretical  estimates  provided  below  for  guidance. 

We  set  the  parameter  pfmm  by  the  prescribed  accuracy  e,  and  approximate  ep  (p.  X)  as 


ep(p,\)  =  BP\-p, 


In  (BP/ e ) 
In  A 


(31) 


where  Bp  is  some  constant.  The  number  of  evaluation  points  is  Mb  =  Ar  +  2L,  while  the  number  of  sources 
to  be  summed  are  TV&  =  O  (X‘Nj .  For  a  perfectly  optimized  FMM,  in  which  the  cost  of  translations  and 
direct  summations  dominates  the  cost  of  other  procedures,  the  complexity  is  proportional  to  the  geometric 
mean  of  the  number  of  sources  and  evaluation  points.  Assuming  L  ~  2 p2  the  cost  can  be  estimated  as 

C{FMM)  =  o  =  o  (pj?MMN\V2  (i  +  7  ^ 

=  Afmm  p2FaMMNX3/2  (l  +  Aln7jf)  7  •  l<a<l,  (32) 

where  Afmm  is  some  constant.  Of  course,  for  N  3>  p2  the  second  term  in  the  parentheses  can  be  dropped. 
However,  for  A  -A  1,  we  may  have  p2  ~  N,  and  as  soon  as  the  complexity  is  a  product  of  the  decreasing 
and  increasing  functions,  some  minimum  is  expected.  Our  tests  show  that  the  cost  of  the  FMM  is  the  major 
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contributor  to  the  overall  cost  of  the  algorithm.  Taking  the  derivative  of  C(FMM'1  with  respect  to  A  and 
setting  it  to  zero,  we  obtain  \opt  from  the  solution  of  the  cubic  equation. 


2x3  —  x2  —  A  =  0,  x  = 


In  A, 


A  = 


N 


opt 


4  In2  (Bp/e) 


At  very  large  A  (4  >  1)  we  have 

A  opt  ~  exp 


2 


=  exp 


N 

8  In2  l Bp/e ) 


-1/3' 


which  shows  that  at  large  N  optimal  A  should  be  shifted  towards  the  limit  A  =  1 . 


(33) 


(34) 


3.4  Constant  shift  in  potential 

Several  methods  can  be  proposed  to  determine  coefficient  C\  if  it  is  needed.  In  particular-  because  we  choose 
to  compare  our  results  with  the  Ewald  summation  method,  we  need  it.  Of  course,  the  simplest  case  is  that 
when  the  potential  value,  c/>0,  is  prescribed  or  known  at  some  point  yo,  in  which  case 

p 

Cl  =  <f>0-  (pnear  (Yo)  -  CtRt  fo)  -  (35) 

t= 2 

Note  then  that  the  Fourier  based  methods  for  periodization  of  Green’s  function  produce  solution  with 
some  mean  of  the  potential  4>mean  (since  the  zero  mode  of  the  Fourier  transform  is  zeroed).  This  particular 
solution  corresponds  to 

=  TF  [  <t>  (y)  dV  (y)  =  Kean-  (36) 

ho  Jil0 

In  this  case  for  consistency  we  should  set 


Cl  =  Kean  ~ /r  [  Kar  (y )  dV  (y)  -  £  ^ [  Rt  (y )  dV  (y) .  (37) 

ho  Jn0  ho  Jq0 

As  shown  in  Appendix  C,  the  Ewald  summation  produces  4>mean  =  0,  and  we  do  the  same  for  our  method. 
Integrals  7?|°'1  can  be  computed  relatively  easy,  since  Rt  (y)  are  polynomials  in  the  Cartesian  coordinates 
of  y  of  degree  which  does  not  exceed  p  —  1,  for  which  case  exact  quadratures  exist.  In  fact,  for  a  given 
box  size  ratio  (e.g.  for  cube)  these  can  be  precomputed,  scaled  and  used  independently  of  particular  source 
distribution  (see  Appendix  A).  The  first  integral  can  be  represented  as  a  sum 


1 

Vb 


& near  (y)  dV  (y) 


H  Qj® o(xj), 

xjen6 


4>o  (x) 


x)  dV  (y). 


(38) 


In  Appendix  B  we  provide  analytical  expressions  for  functions  <I>o  (x).  Despite  their  unwieldiness,  their 
computation  for  a  given  x  is  0(1).  Overall  it  is  a  O  (N)  procedure  to  compute  the  sum  and  constant  C\, 
which  is  consistent  with  the  overall  complexity  of  the  method.  There  also  exist  symmetries  for  periodic 
location  of  sources,  which  can  be  used  to  accelerate  these  computations,  if  this  becomes  an  issue.  Note  also 
that  results  of  Appendix  B  can  be  applied  for  computation  of  integrals  representing  the  far  field  (37)  in  the 
case  when  the  RBF,  Eq.  (6),  is  used. 
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4  Numerical  tests 


To  check  the  accuracy  and  performance  of  the  method  we  conducted  several  numerical  tests.  There  arc  very 
few  known  analytical  solutions,  so  for  comparison  we  also  implemented  and  tested  a  simple  version  of  the 
Ewald  summation  method  as  an  alternative  method  (see  Appendix  C). 

4.1  Small  size  tests 

As  validation,  we  performed  tests  with  different  number  of  sources  in  the  box.  First,  we  conducted  a 
small  size  test,  with  a  cubic  domain  Oq  and  eight  sources  of  charges  qt  =  ±1  located  at  the  vertices,  so 
that  neighboring  sources  have  opposite  charges  and  the  infinite  domain  forms  a  regular  equispaced  grid. 
Physically  this  corresponds  to  crystal  structures,  such  as  formed  by  molecules  NaCl.  As  the  reference  for 
accuracy  tests  we  computed  the  Madelung  constant  for  this  crystal  [34], 

N 

Ma^a  =  -Mac  1  =  4>  (yNa)  =  -^NaCl  EE  QiK  (yNa  -  Xj  +  p)  (39) 

p  i=  1 

— - - *  =  -1.74756459463318219..., 

+  k 2  +  l2)2 

where  ypja  is  the  location  of  Na  atom  and  i?NaCi  is  the  distance  between  the  closest  neighbor  atoms  (in  the 
tests  we  used  d\  =  d-2  =  =  1,  in  which  case  i?NaCi  =  0.5).  We  also  computed  this  constant  using  the 

Ewald  summation  and  compared  spatial  distributions  of  the  potential.  As  the  measures  of  the  relative  errors 
we  used 


1  M 

(40) 

i= 1 

where  y(  6  Qq  are  the  receivers  located  on  the  grid  used  for  the  Ewald  summation. 

For  the  high  accuracy  test,  we  selected  a  44  x  44  x  44  grid,  £  =  12,  and  the  sampling  neighborhood  for 
each  source  Nr  =  20  for  the  Ewald  method  (see  Appendix  C).  This  setting  provides  cm  ~  I  (r  1 4  (i.e.  14 
digits  of  the  Madelung  constant).  High  accuracy  test  for  the  present  method  was  performed  with  p  =  35, 
Rb  =  1.5  (A  =  Rb/Ro  =  \/3),  which  resullts  in  errors  cm  ~  6  •  10“  14  and  eq  ~  9  ■  10  l:!.  For  the  middle 
accuracy  test  we  used  24  x  24  x  24  grid,  £  =  10,  and  the  sampling  neighborhood  for  each  source  Nr  =  10, 
in  which  case  the  Ewald  method  results  in  eMa  ~  6  •  10-9.  In  our  method  we  used  p  =  16,  Rb  =  1.5, 
which  produced  errors  6m  ~  7  •  10  8  and  £2  ~  10“6.  These  tests  show  that  errors  cm  and  £2  are  related  and 
the  former  one  approximately  one  order  of  magnitude  smaller  than  the  latter.  So  in  the  following  accuracy 
tests  we  measured  only  6m  for  our  method,  which  is  independent  of  the  Ewald  summation  routine.  These 

(c) 

computations  were  performed  for  the  check  points  y)  distributed  on  the  Gauss  spherical  grid. 

Figure  3  shows  the  dependence  of  cm  computed  for  65 1  values  of  parameters  A  =  Rb/Ro  and  p  control¬ 
ling  the  accuracy.  The  chart  on  the  right  shows  that  the  computational  errors  arc  consistent  with  theoretical 
error  bound  eth.  =  Ce  ( Ro/Rb)p ■  For  very  small  values  of  eth  the  computational  errors  arc  affected  by  the 
double  precision  roundoff  errors.  This  shows  that  the  parameters  can  be  set  to  achieve  the  required  accuracy. 

Table  1  shows  some  results  of  the  tests  with  different  distributions  of  the  check  points.  Here  for  the  case 
of  random  distributions  for  any  set  size  we  performed  100  runs  and  the  maximum  error  is  reported.  It  is 
seen  that  the  lowest  errors  were  achieved  using  the  Thomson  point  distributions.  The  number  of  such  points 
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P  Theoretical  error 


Figure  3:  The  relative  error,  em,  in  computations  of  the  Madelung  constant  for  NaCl  crystal  using  the  present 
method  (colors).  The  chart  on  the  graph  compares  the  theoretical  computational  error,  eth  =  Ce  ( / Ri,)p 
with  the  actual  error  for  all  data  points  used  to  plot  the  chart  on  the  left.  Ideally,  the  graph  on  the  right  should 
be  a  straight  line  (shown).  Constant  Ce  was  set  as  em / t-th  for  p  =  10  and  A  =  Rb/Ro  =  1.5. 


Table  1 :  Error  em  for  different  check  point  distributions  at  A  =  Rb/Ro  =  v/3. 


p 

Gauss  sph.  grid 

T(256) 

T(400) 

Rand./.  =  2  p2 

Rand.L  =  3 p2 

Rand.L  =  5  p2 

8 

4.85(-6) 

3.87(-6) 

3.90(-6) 

3.39(-5) 

2.21(-5) 

1.92-(5) 

12 

1.17(-7) 

4.60(-8) 

6.28(-8) 

1.53(-6) 

1.04(-6) 

5.10(-7) 

16 

7.16(-8) 

3.43(-7) 

2.74(-8) 

6.26(-7) 

3.50(-7) 

1.85(-7) 

should  be  not  less  than  p2  —  1,  which  is  necessary  (but  not  sufficient)  condition  for  the  use  of  the  present 
method.  When  the  number  of  check  points  approaches  p2  —  1  the  accuracy  of  the  method  deteriorates.  Con¬ 
clusion  here  is  that  if  a  database  of  the  Thomson  points  or  some  analogous  method  of  deterministic  uniform 
distribution  of  the  check  points  exsist,  then  that  method  is  recommended.  In  fact,  the  Gauss  spherical  grid 
also  provides  good  results  (the  order  of  the  error  is  the  same).  This  grid  is  easy  to  generate  for  any  p,  and 
that  is  why  this  was  used  in  the  tests.  The  error  for  random  distributions  is  about  one  order  of  magnitude 
larger  than  that  for  the  Gauss  spherical  grid  or  for  the  Thomson  points.  It  slowly  decays  with  the  growing 
oversampling.  Perhaps,  there  is  no  reason  to  use  random  sets,  which  anyway  show  strong  dependence  of  the 
emor  on  p  and  also  can  be  used  if  needed. 

We  also  performed  the  accuracy  test  for  different  basis  functions  (RBF,  Eq.  (6)),  where  256  Thomson 

(s) 

sampling  sources  x)  '  were  located  on  ball  Si„  while  the  check  points  were  the  same  as  for  the  last  line  of 
Table  1  (p  =  16).  In  this  case  we  obtained  em  ~  1.32  •  10  ,  which  is  approximately  two  times  larger  than 

the  error  when  using  the  spherical  basis  functions. 

4.2  Large  scale  tests 

The  large  scale  tests  were  conducted  for  systems  with  N  up  to  223  ~  107,  for  which  ()(N  log  N )  summation 
algorithms  are  needed.  The  main  puipose  of  these  tests  is  to  check  the  performance  and  scaling  of  the  present 
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algorithm.  The  reported  wall  clock  times  were  measured  on  an  Intel  QX6780  (2.8  GHz)  4  core  PC  with  8 
GB  RAM  and  averaged  over  ten  runs  of  the  same  case. 

We  used  a  well-tested  standard  version  of  the  FMM  for  the  Laplacian  kernel  in  three  dimensions,  where 
all  translations  are  performed  with  O  {tPfmm)  complexity  using  the  rotation-coaxial  translation-back  ro¬ 
tation  (RCR)  algorithm.  The  code  implements  a  standard  multipole-to-local  translation  stencil  with  the 
maximum  189  neighborhood  (see  details  in  [36]).  The  code  was  parallelized  for  4  core  CPU  machine  using 
OpenMP  with  parallelization  efficiency  close  to  100%.  While  faster  versions  of  the  FMM  were  available  to 
the  researchers  (say  utilizing  graphics  processors,  [16]),  the  version  used  for  the  tests  was  selected  to  provide 
consistent  scaling  of  different  algorithm  parts,  as  the  periodization  algorithm  was  implemented  on  multicore 
CPUs.  For  the  tests  we  treated  the  FMM  as  a  black  box  FSA  and  used  it  “as  is”  without  any  modifications. 

First  we  ran  accuracy  tests,  when  the  charges  have  random  intensity,  ±1,  and  zero  total  sum  and  are 
located  inside  a  unit  cube  at  regularly  spaced  grid  points  (subgrids  of  60  x  60  x  60  grid).  The  potential 
at  charge  locations  (reference  solution)  was  computed  with  high  accuracy  using  the  Ewald  method  (grid 
60  X  60  X  60,  £  =  15,  Nr  =  24).  The  present  method  with  p  =  40  and  pfmm  =  30  for  this  case  showed 
error  62  ~  1.2  •  10”10.  Further  performance  tests  were  conducted  with  lower  tolerance  to  ensure  that  the 
eiTor  of  the  reference  solution  does  not  affect  error  and  optimization  studies.  In  all  cases  the  Gauss  spherical 
grid  ( L  =  2p2  —  p)  was  used  for  the  check  points. 


slice  Z  =  0 


Figure  4:  Comparison  of  periodic  solutions  at  the  median  plane  z  =  0,  obtained  by  different  methods  for 
30  x  30  x  30  sources  of  random  intensity  g*  =  ±1  in  a  unit  box.  The  top  row  shows  respectively  the  periodic 
solution  obtain  by  the  Ewald  method,  the  non-periodic  solution,  and  the  difference  between  the  former  and 
latter  potentials.  The  bottom  row  shows  the  periodic  solution  obtained  by  the  method  proposed  in  this  paper, 
and  its  near  and  far  field  components  for  A  =  1.1.  Computations  performed  with  p  =  80,  pfmm  =  16. 


Figure  4  illustrates  distribution  of  potentials  generated  by  27,000  charges  in  a  box  computed  by  two 
different  methods  sampled  on  60  x  60  x  60  grid.  It  is  seen  that  periodic  solutions  obtained  using  the  Ewald 
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method  and  present  method  arc  almost  the  same  (€2  <  5  •  1CT7),  while  substantially  different  from  the 
non-periodic  solution  (the  free  field  generated  by  the  same  sources).  This  also  shows  that  accounting  for 
the  near  field  (sources  in  fij,)  substantially  improves  the  non-periodic  solution,  but  the  far  field  component 
is  still  of  magnitude  comparable  with  the  near  field.  This  far  field  is  smooth  and  its  addition  to  the  near  field 
results  in  an  accurate  periodic  solution. 


Table  2:  Performance  for  different  A  for  tolerance  etol,2  =  5-10  7  at  N  =  27,  000,  pfmm  =  16. 


A 

Pmin 

T^9et\ s 

T\get)  ,,s 

ji(FMM)  s 

Nb 

N  +  2L 

T^set\ s 

1.1 

80 

2.81 

2.64 

2.03 

97,780 

52,440 

150 

1.2 

44 

1.95 

1.73 

1.58 

126,744 

34,656 

5.31 

1.3 

34 

2.03 

1.78 

1.66 

161,072 

31,556 

1.27 

1.5 

24 

2.21 

1.77 

1.69 

247,742 

29,256 

0.29 

1.7 

19 

2.53 

1.98 

1.90 

360,640 

28,406 

0.13 

As  the  theory  predicts  existence  of  an  optimal  value  of  parameter  A  for  a  given  tolerance  we  conducted 
tests  to  determine  this  value  experimentally.  Tables  2  and  3  display  the  results  of  these  tests.  Here  we 
computed  potential  alone  (no  gradient  computations).  The  difference  between  the  cases  shown  in  the  tables 
is  in  the  number  of  charges  (27,000  and  216,000,  respectively).  In  these  tests  pfmm  =  16,  which  provided 
the  relative  L2-norm  error  of  the  FMM  itself  smaller  than  tolerance  etoi,2-  Since  the  truncation  number 
p  changes  discretely,  there  is  the  minimal  integer  p  =  pm\n  at  which  €2  <  £2, tot,  where  £2  is  defined  by 
Eq.  (40).  This  p  is  slightly  depends  on  N  and  is  shown  in  the  tables.  The  periodization  algorithm  was 
executed  with  this  p.  The  tables  also  show  the  number  of  sources,  A),,  and  the  total  number  of  evaluation 
points,  N  +  2 L.  These  numbers  provide  data  of  the  size  of  the  problem  solved  by  the  FMM.  It  is  seen 
that  the  FMM  execution  time  is  a  non-monotonic  function  of  A,  as  at  the  increasing  A  we  have  increasing 
Nb  =  O  (iVA3)  and  decreasing  L  ~  -Pm\n-  As  the  theoretical  complexity  of  the  optimized  FMM  is 
proportional  to  \jN~b  (N  +  2 L)  there  can  be  a  minimum  of  this  function,  which,  indeed,  is  realized  in 
experiments  shown  in  Table  2.  This  is  not  the  case  for  data  from  Table  3,  but  the  explanation  can  be 
that  optimization  of  the  FMM  (the  maximum  depth  of  the  octree)  changes  discretely,  so  the  performance 
depends  on  the  source  and  receiver  distribution.  It  is  also  noticeable  that  despite  of  substantial  change  of  Nb 
the  FMM  time  does  not  change  significantly.  Finally,  the  time  of  the  “set”  part  of  the  algorithm  increases 
dramatically  at  large  p  (as  j>(>).  However,  for  a  given  p  and  computational  domain  the  pseudoniverse  matrix 
can  be  precomputed  and  stored  independently  on  the  number  of  charges  and  their  distribution.  So  this  should 
not  be  considered  as  a  limiting  factor.  These  tests  bring  us  to  conclusion  that  practical  optimal  A  are  in  the 
range  ~  1.2  —  1.5.  Smaller  or  larger  A  can  be  also  used  based  on  particular  problems  and  other  issues  (e.g. 
memory  complexity). 

Effect  of  the  basis  functions  on  the  algorithm  performance  is  shown  in  Table  4.  As  in  the  small  size  tests 
256  Thomson  points  were  selected  as  the  centers  of  the  RBFs.  Comparison  of  the  performance  obtained 
using  the  RBFs  vs  the  spherical  basis  functions,  show  that  the  latter  choice  is  beneficial  in  terms  of  the 
accuracy,  while  the  speed  is  approximately  the  same.  Nonetheless,  the  errors  for  both  bases  are  of  the  same 
order,  while  the  RBF  implementation  is  slightly  simpler. 

Figure  5  illustrates  scaling  of  the  algorithm  with  the  problem  size.  In  this  test  N  sources  were  distributed 
randomly  in  the  unit  box  and  both  0  (with  zero  mean)  and  V4»  were  computed  at  the  source  locations. 
Here  only  the  time  for  the  “get”  paid  is  displayed.  For  comparison  we  also  plotted  the  wall  clock  time 
for  solution  of  the  same  non-periodic  problem,  where  the  FMM  with  the  same  pfmm  was  used.  (The 
discussion  of  GPU  times  is  below).  While  for  the  non-periodic  case  the  FMM  is  scaled  approximately  as 
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Table  3:  Performance  for  different  A  for  tolerance  etoi, 2  =  5-10  7  at  TV  =  216, 000,  pfmm  =  16. 


A 

Pmin 

T(ffet),s 

T\get)  ,,s 

(no .mean)  ’ 

rp(FMM)  s 

Nb 

TV  +  2L 

T^set\ s 

1.1 

80 

15.7 

14.6 

11.5 

782,131 

241,440 

150 

1.2 

44 

12.6 

11.2 

10.2 

1,015,037 

223,656 

5.31 

1.3 

34 

12.8 

11.1 

10.3 

1,291,247 

220,556 

1.27 

1.5 

25 

15.0 

12.3 

11.7 

1,983,665 

218,450 

0.32 

1.7 

19 

18.7 

14.7 

14.1 

2,887,393 

217,406 

0.13 

Table  4:  Performance  for  different  basis  functions  at  A  =  1.5,  p  =  16,  pfmm  =  12. 


TV 

Basis 

(-2 

T(^),  s 

T\9et)  ,,s 

(no .mean)  ’ 

T^set\ s 

27,000 

Spherical 

1.3(-5) 

1.67 

1.30 

0.08 

RBF 

2.6(-5) 

1.74 

1.35 

0.06 

216,000 

Spherical 

1.6(-5) 

12.3 

9.57 

0.08 

RBF 

2.6(-5) 

12.7 

9.95 

0.06 

O(N)  (with  some  small  TV  log  TV  addition  due  to  data  structures),  the  present  algorithm  is  scaled  sublinearly 
at  small  TV,  while  it  approaches  the  same  scaling  as  the  regular  FMM.  Qualitatively  this  can  be  explained  by 
O  ^  \/Nf,  (TV  +  2V)j  scaling  of  the  FMM  as  it  used  in  the  present  algorithm  for  periodization.  So  the  ratio 
of  the  FMM  time  in  the  present  algorithm  and  the  FMM  for  non-periodic  problem  can  be  estimated  as 


4mm  VNb(N  +  2L) 

-  r-~i  - 

rp{non)  ]y 

1  FMM 


2L(A)\1/2 

) 


(41) 


This  ratio  for  different  A  along  with  the  experimentally  measured  time  ratio  of  the  “get”  paid  of  the 
algorithm  and  the  FMM  for  non-periodic  problem  is  plotted  in  Fig.  6.  It  is  seen  that  qualitatively  this 
explains  the  observed  results,  and  at  relatively  small  TV  the  wall  clock  time  of  the  present  algorithm  requires 
is  several  times  larger  than  the  time  of  non-periodic  FMM.  At  larger  times  the  ratio  should  come  to  an 
asymptotic  limit  depending  on  A.  While  the  theory  predicts  that  A  =  1.1  limit  should  be  smaller  than 
A  =  1.3  limit,  we  found  that  in  the  range  of  experiments  they  are  approximately  the  same  (time  ratio  about 
2).  So  one  can  say  that  the  periodization  overhead  at  large  TV  is  approximately  the  same  as  the  run  time 
of  the  FMM  for  the  non-periodic  problem.  The  reason  why  the  experimental  data  deviate  from  Eq.  (41)  is 
that,  first,  there  are  some  O(TV),  O(PN),  and  0(P2)  overheads  in  the  periodization  algorithm,  and,  second, 
that  the  optimization  of  the  FMM  can  be  done  only  discretely  (changing  the  number  of  the  levels  of  the 
octree).  Indeed,  we  checked  the  profiling  of  the  FMM  for  both  cases,  and  found  that  while  this  algorithm 
is  performing  at  its  minimum  time  for  a  given  distribution  and  pfmm,  the  parts  of  the  FMM  related  to 
translations  and  direct  summation  were  not  balanced  perfectly  (several  times  difference). 

Finally  we  note  that  it  is  not  an  easy  task  to  compare  the  absolute  performance  of  the  present  meth¬ 
ods  with  the  smooth  particle  mesh  Ewald  (SPME)  and  other  algorithms  for  periodic  summation  due  to  a 
difference  in  implementation,  accuracy,  what  is  actually  computed  (potential  and  gradient),  hardware,  etc. 
However,  some  comparison  with  that  approaches  can  be  done  using  published  data  [14]  comparing  perfor¬ 
mance  of  the  SPME  and  FMM-type  PWA  implementation  for  clusters,  for  relatively  small  size  problems 
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Figure  5:  Wall  clock  times  of  the  present  algorithm  with  the  FMM  as  FSA  for  A  =  1.3,  p  =  34  (filled 
squares),  and  with  brute-force  4  core  CPU  summation  (filled  triangles),  and  Tesla  C2050  GPU  summation 
(filled  circles),  for  a  periodic  system  replicating  distribution  of  N  random  sources  in  a  unit  cube  (potential 
and  gradient  computations,  Laplacian  kernel  in  three  dimensions).  The  corresponding  empty  squares,  tri¬ 
angles  and  circles,  connected  by  a  solid  line  show  performance  of  the  particular  FSA  for  the  non-periodic 
problem  with  the  same  source  distribution  in  the  unit  cube  and  the  same  truncation  number  pfmm  =  16. 
The  dashed  lines  continue  the  trendlines  to  indicate  performance,  if  memory  resources  had  not  been  ex¬ 
ceeded.  The  quadratic  scaling  of  the  brute-force  summation,  and  the  linear  scaling  for  the  FMM  arc  seen. 
Brute  force  GPU  sums  for  this  case  arc  faster  till  about  N  ~  40, 000. 


(N  =  104  and  N  =  105).  The  absolute  figures  indicate  that  the  wall  clock  time  of  the  present  algorithm 
for  these  sizes  is  of  the  same  order  as  the  reported  times  for  those  methods,  while  we  arc  able  to  perform 
computations  with  larger  problem  sizes  on  relatively  modest  hardware. 

4.3  Using  graphics  processing  units  (GPUs)  for  summation 

GPUs  arc  often  used  to  accelerate  molecular  dynamics  simulations  where  periodization  may  be  required. 
We  implemented  the  “get”  part  of  the  algorithm  completely  on  the  GPU.  Using  “brute-force”  summation  on 
the  GPU  and  on  a  multicore  CPU  as  the  FSA,  we  did  the  same  performance  tests  as  for  the  FMM,  see  Fig. 
5.  The  ’’set”  paid  can  be  executed  on  CPU  and  transferred  to  the  GPU.  All  parts  of  the  algorithm  arc  highly 
parallel  izablc  and  so  while  the  scaling  of  the  algorithm  is  quadratic,  it  is  up  to  two  orders  of  magnitude  faster 
than  the  multicore  CPU  version. 

Our  tests  reveal  that  at  large  truncation  numbers  p  (typically  p  >  30)  single  precision  GPU  compu¬ 
tations  cannot  be  used  for  spherical  basis  function  evaluation  due  to  loss  of  precision  in  least-squares  so¬ 
lution.  Double -precision  computations  provide  accurate  results  with  L'2-norm  relative  errors  of  the  order 


18 


Figure  6:  The  ratio  of  the  solution  times  of  the  periodic  and  non-periodic  problems  using  the  FMM  for  the 
cases  shown  in  Fig.  5.  The  dotted  and  the  dashed  curves  show  theoretical  estimate  of  the  FMM  time  ratio 
for  A  =  1.1  and  A  =  1.3,  respectively,  according  to  Eq.  (41). 


1(F  13  —  HP15  in  potential  and  gradient  computations  compared  to  the  double  precision  CPU  computations. 
Accordingly  Fig.  5,  only  shows  double  precision  times.  High  performance  implementation  of  the  brute- 
force  summation  is  described  in  [16]  and  was  used  in  the  present  tests  with  single  and  double  precision,  and 
run  on  a  single  NVIDIA  Tesla  C2050  card.  The  CPU  wall  clock  times  used  for  comparisons  were  measured 
for  algorithm  parallelized  for  4  core  PC  described  before. 

Fig.  5  shows  the  results  of  tests.  It  is  seen  that  despite  the  asymptotic  quadratic  scaling  of  the  algorithm 
the  GPU  implementation  can  be  faster  or  comparable  in  speed  with  the  FMM  running  on  CPU  at  pfmm  = 
16  for  problems  of  size  N  <  105.  The  ratio  of  times  for  solution  of  periodic  and  non-periodic  problems  for 

brute  force  computations  at  large  N  tends  to  theoretical  limit  =  7t^/|A3,  which  for  A  =  1.3 

used  for  illustrations  is  about  6  times.  Of  course,  this  ratio  can  be  reduced  by  decreasing  A.  However,  in  the 
range  N  <  105  reduction  of  A  below  1.1  does  not  benefit  the  overall  performance  due  to  increase  of  the  size 
of  the  expansion  and  reduction  of  the  performance  of  evaluation  of  the  far  field  on  the  GPU,  where  the  local 
memory  is  substantially  smaller  than  the  CPU  cache.  One  of  the  advantages  of  brute  force  double  precision 
GPU  computing  for  problems  of  relatively  small  size  (N  <  105)  is  that  besides  the  roundoff  errors  the 
accuracy  is  controlled  only  by  parameters  p  and  A,  while  for  the  FSA=FMM  the  error  is  controlled  also  by 
Pfmm ■  For  efficient  FMM  implementations  on  GPU  this  number  is  usually  small  [16]  while  the  efficiency 
of  the  FMM  on  GPU  for  high  accuracy  simulations  is  a  subject  for  a  separate  study. 

5  Conclusion 

We  have  presented  a  kernel-independent  method  for  the  periodization  of  finite  sums.  The  technique  was 
presented  in  a  general  setting,  and  then  applied  to  the  particular  case  of  the  Laplacian  kernel  using  different 
expansion  bases.  Tests  showed  that  the  method  can  be  tuned  to  compute  periodic  sums  with  arbitrary 
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prescribed  accuracy.  In  the  case  of  use  of  the  FMM  as  the  fast  summation  algorithm  the  complexity  of  the 
method  at  large  N  is  the  same  as  the  FMM.  The  computational  time  for  large  N  (in  tests  up  to  N  =  223) 
is  about  twice  that  for  the  finite  box  sum  using  the  FMM.  Similar  results  are  seen  for  GPU  based  FSA, 
though  here  the  scaling  is  quadratic,  and  the  largest  problem  size  that  can  be  reasonably  treated  is  about  105. 
The  ease  of  implementation  of  the  periodization  method,  its  performance,  and  capability  to  “retrofit”  any 
available  black  box  FSA  without  any  modification  makes  it  practical.  This  method  may  also  be  valuable  on 
distributed  architectures  on  which  communication  costs  of  an  algorithm  arc  as  important  as  computational 
complexity.  FMM-based  approaches  arc  known  to  be  much  more  communication  efficient  than  FFT-based 
approaches  for  solution  of  the  same  large  problems  on  distributed  architectures.  Additional  speedups  of  the 
method  can  be  achieved  by  specialization  of  the  FMM  -  these  were  specifically  avoided  in  this  paper  to 
demonstrate  the  ability  to  use  a  blackbox  sum  algorithm,  and  should  be  investigated  if  the  method  is  to  be 
used  in  a  “production”  environment.  Application  to  other  kernels  should  be  straightforward,  though  details 
will  have  to  be  worked  out  for  them. 
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A  Box  integrals  of  the  basis  functions 


Basis  functions  R ™  (y)  arc  homogeneous  polynomials  of  degree  n  (sums  of  monomials  xn  1  y”2 c"3 ,  m  + 
ri2  +  «:>  =  1).  We  can  compute  the  integrals  as 


iU(0)  = 


fdi/2  rd2/2  fd3/2 


l-di/ 2  J-d2/2  J-d3/ 2 


%LMy)dv(y)=ihkL 

Nn  No  Nq 

1  9  q  q  / 1  ^  1 
sEEE  WiWjWkRt  (  -diXi,  - d2Xj ,  -d3xk 


Rt  (x, y, z)  dxdydz 


(42) 


i= 1  j= 1  k= 1 


where  w,  and  x%  arc  the  standard  weights  and  abscissas  of  the  Gauss  quadrature  of  order  Nq  [29].  Since  this 
integration  is  exact  for  polynomials  of  degree  n  <  2 Nq  and  the  maximum  degree  of  the  polynomials  in  the 
sum  is  p  —  1 ,  the  choice 


Nq  = 


p  —  1 


+  1,  P  =  P1/2, 


(43) 


^  '^2 ( Q) 

provides  an  exact  result.  For  evaluation  of  all  P  required  coefficients  Rn  y  the  computational  cost  of  this 
procedure  is  O  (P5/2)  .  Note  that  faster  methods  may  be  proposed  for  computation  of  this  step.  However, 
this  is  not  crucial,  as  this  integration  is  performed  in  the  “set”  paid  of  the  algorithm,  which  overall  cost  is 
O  (P3),  and  this  cost  is  amortized  over  the  rest  of  the  algorithm. 
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B  Mean  computations 


To  compute  the  integral  (38)  for  the  kernel  (2),  we  first  apply  the  Gauss  divergence  theorem  to  reduce  the 
volume  integral  to  a  surface  integral: 


$o  (x) 


1  f  dv  (y) 

Vo  Ja.0  |y  -  xl 


1 

2Vb 


y  -  x 

|y  -  x| 


dV(  y) 


2  Vo  Idfi0 


n  •  (y  -  x) 

|y  -  x| 


dS  (y) ,  (44) 


where  n  is  the  outward  normal  to  Q0.  This  result  is  valid  for  an  arbitrary  point  x  including  when  x  is  located 
in  Go  or  on  its  boundary  <9Go-  This  can  be  checked  by  consideration  of  e-vicinities  of  singularities,  which 
arc  integrable.  The  surface  integral  can  be  decomposed  into  integrals  over  the  box  faces,  Sk,  k  =  1, 6. 
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y k  -  xfc 


yfc  =  y  -  yfc0> 


1  x  A 
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Xfc  —  x  yfc0, 


(45) 


where  n and  y^o  are  the  normal  and  the  center  of  the  ktl l  face,  while  yk  and  x/,.  are  coordinates  in  the 
reference  frame  with  the  origin  at  the  kth  face  center.  The  surface  integral  then  can  be  reduced  to  the 
contour  integral  using,  e.g.  the  Gauss  divergence  theorem  in  the  plane  of  a  particular-  face.  Indeed,  consider 
function 


Ffc  (rfc) 

r  k 


r  kfk(rk]hk),  fk  ( rk;hk )  =  Pk  2  hk  ,  Pk  =  Jrl  +  hl  (=  I  yfc  -  xfc|)  * 

1  k 

yfc -x'fc,  Xfc  =  Xfc -n khk,  /ifc  =  nfc  •  Xfc. 


(46) 


The  2D  divergence  of  this  function  in  the  plane  of  the  kth  face  is  1  / pk.  So 


Ffc(xfc)=/  Vrfc-Ffc(rfc)dS(rfc)=  [  n'k  ■  Fk  (rfc)  dl  (rfc) ,  (47) 

where  n^.  is  the  outer  normal  to  the  contour  Ck  =  dilk.  This  integral  can  be  decomposed  into  four  integrals 
over  the  face  edges.  So 

4  r  -  h 

Lk  (Xfc)  =  ^2  hj  (xfc) ,  Ikj  (Xfc)  =  /  n'kj  ■  rkPk  2  k  dl.  (48) 

j= i  "'Gcj  rfc 

The  latter  integrals  can  be  found  analytically.  Indeed,  consider  for  the  jth  edge  a  local  right  hand  oriented 
reference  frame  centered  at  its  endpoint  y kjo  from  which  integration  starts,  and  unit  basis  vectors  i'k-x 
directed  along  the  integration  path,  i'kjy  =  nk  and  i'kjz  =  n'kj  =  i'kjx  x  i'kjy.  Denoting  coordinates  of  x  in 
this  reference  frame  as 


xkj  —  (xfc  Ykjo)  ■  i kjxi  Vkj  —  (Xfc  yfcjo)  ■  ifcjj,)  zkj  —  (Xfc  yfcjo)  ■  ifcjz) 
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~%kj 

where  rjV  =  x2  +  zk:) ,  lkj  is  the  length  of  edge  Ckj,  and  77  (x,  y,  z)  is  the  primitive, 


H  (x,y,z)  =  -z  /  /(r;|y|)dx,  f(r;y)  = 


=  x2  +  z2,  p=  \Jr2  +  y2,  (51) 
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which  can  be  computed  analytically  as 


H  (, x,y,z ) 
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(  x 

arctan - arctan 

V  * 


z  In  \p  +  x\  +  C(y,z) . 


(52) 


The  integration  constant  C  (y,  z )  can  be  selected  arbitrarily  to  eliminate  possible  singularities.  Particularly 
for  y  7^  0,  z  =  0  one  can  set  H  =  0.  The  above  formulae  are  sufficient  for  numerical  implementation,  which 
in  the  simplest  form  can  program  the  primitive  (52)  and  implement  the  above  decompositions.  There  exist 
some  box  symmetries  (e.g.  all  local  coordinates  are  nothing  but  permuted  and  shifted  original  Cartesian 
coordinates),  which  can  be  exploited  to  achieve  better  performance. 


C  Ewald  summation 


The  Ewald  summation  method  is  based  on  decomposition  of  kernel  (2) 
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2£ 


,  y  /  x;  I<\  (0;  £)  =  — =, 
y  -  x|  V71- 


erf  (£  |y  ~  x| 
|y  -  x| 


Vy,x  e 


I<2  (0;  0  =  ^  )  , 

7 r  / 


(53) 


which  is  exact  for  any  value  of  parameter  £,  since  by  definition  of  the  error  function,  erf(x),  and  the  com¬ 
plimentary  error  function,  erfc(x),  we  have  erf(x)  +erfc(x)  =  1  and  the  value  of  K\  (0;£)  is  set  due  to  by 
definition  K  (0)  =  0.  So  for  the  total  potential  (2)  we  have 


N 

<t>  (y)  =  <t>\  (y)  +  <^2  (y) ,  <t>i  (y)  =  EE  QiKi  (y  -  x,:  +  p;  0  , 

p  i=  1 

N 

02  (y)  =  ^2  qiK 2  (y  -  x*  +  p;  0  ■  (54) 

p  i=i 

Both  functions  4>1  (y)  and  <i>2  (y)  arc  periodic. 

Due  to  fast  decay  of  erfc(x')  computation  of  4>1  (y)  for  y  6  Do  can  be  done  only  using  the  sources  in 
some  neighborhood  of  Do,  namely  in  Di  D  Do  such  that  the  minimum  distance,  a,  between  the  points  on 
the  boundaries  <9Do  and  0Q  i  is  much  larger  than  l/£.  Hence,  this  can  be  computed  directly  by  evaluation 
of  a  finite  sum  with  a  controllable  error  as 


0i  (y)  =  H  qiKi  (y  _  xi;  0  +  °  (e  "c  “2)  • 


(55) 


For  computation  of  (f>2  (y)  one  can  notice  that  I\  2  is  a  solution  of  the  Poisson  equation 

V2K-2  (y  -  x;  0  =  (y  -  x) ,  (y  -  x)  =  4^e_e|y”x|2,  (56) 

where  5 £  (y  —  x)  is  a  compactly  supported  function,  which  turns  to  the  Dirac  delta-function  as  £  — >  oo. 
Periodic  solution  of  the  Poisson  equation  can  be  obtained  via  the  FFT.  For  this  purpose,  we  grid  the  domain 
Do  and  select  £  in  a  way  that  £  <C  1  /h,  and  £  S>  1/ max  (d\ .  d2,  e/3)  (an  optimal  setting  can  be  found  from 
analysis  of  the  error  bounds),  where  h  is  the  minimum  spatial  step  of  the  grid.  This  enables  sampling  of 
5^  (y  —  x)  for  source  x  =  x*  at  several  grid  points  around  x,.  The  number  of  these  grid  points  determines 
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the  accuracy  of  the  method  (at  optimal  settings),  so  we  introduce  additional  parameter  Nr,  so  5%  (y)  is 
sampled  in  a  box  ( 2Nr  +  1)  x  ( 2Nr  +  1)  x  ( 2Nr  +  1).  We  also  take  care  about  the  points  x,;  located  near 
the  boundary  of  Qq  by  periodization  (so  we  construct  a  periodic  function  8^  (y  —  x)).  Further,  we  apply 
the  forward  3D  FFT  to 

N 

f'2  (y)  =  V2</>2  (y)  =  -4tt  ^2  qi5 ^p)  (y  -  x*) ,  (57) 

i=  1 

and  zero  the  harmonic  of  the  Fourier  image  /|  (k)  corresponding  to  the  wavenumber  k  =  0.  The  inverse  3D 
FFT  of  <f>2  (k)  =  — A;~  2/|  (k),  produces  the  required  solution  (j)2  (y)  with  zero  mean  at  grid  points.  Note 
then  that  solution  obtained  in  this  way  has  the  following  mean 

(Pmean  (0  =  (</>  (y))n0  =  TF  5Z  V  [  Kl  ^  ~  dV  ~  °' 

The  zero  mean  here  is  due  to  the  compact  support  of  the  kernel  K\  and  charge  neutrality.  This  mean  can  be 
computed  using  decomposition  I\\  (y  —  x.;T)  =  K  (y  —  x^;  ^)  —  K2  (y  —  x; ;  <( ) ,  where  the  integral  with 
the  first  kernel  can  be  computed  analytically  (see  Appendix  B),  while  the  integral  with  the  second  kernel  is 
regular  and  can  be  computed  using,  say,  the  trapezoidal  rule  (in  the  FFT-based  method  the  space  is  gridded). 
To  avoid  interpolation  errors,  in  the  numerical  tests  where  we  compared  our  method  for  accuracy  with  the 
Ewald  summation  method,  we  used  only  cases  when  the  source  and  evaluation  points  arc  located  at  the  grid 
nodes. 
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